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^ ! Abstract 

Q , While Spectral Methods have long been used for Principal Com- 

| ponent Analysis, this survey focusses on work over the last 15 years 

, ^/ with three salient features: (i) Spectral methods are useful not only 

for numerical problems, but also discrete optimization problems (Con- 
straint Optimization Problems - CSP's) like the max. cut problem and 

■ similar mathematical considerations underlie both areas, (ii) Spectral 
. methods can be extended to tensors. The theory and algorithms for 

tensors are not as simple/clean as for matrices, but the survey de- 
scribes methods for low-rank approximation which extend to tensors. 

■ These tensor approximations help us solve Max-r-CSP's for r > 2 as 
<^ ', well as numerical tensor problems, (iii) Sampling on the fly plays a 

prominent role in these methods. A primary result is that for any ma- 
trix, a random submatrix of rows/columns picked with probabilities 
^ . proportional to the squared lengths (of rows/columns), yields esti- 

va | mates of the singular values as well as an approximation to the whole 

matrix. 



1 Introduction 

Spectral methods have been widely used in many areas for Numerical prob- 
lems under the name Principal Component Analysis. The algorithmic use 

* kannan@microsoft . com 

t©This is the author's personal version of the work. It is posted here by permission 
of ACM for personal use, not for distribution. The definitive version is published in the 
Proceedings of the ACM Symposium on Theory of Computing, 2010. 



1 



of spectral methods for discrete problems is perhaps more recent. The first 
point of this survey is to suggest that similar mathematical considerations 
motivate both discrete and numerical applications. 

Our second point is the extension of algorithms to tensors. Linear Algebra 
is unique in that it has beautiful theory which also translates to efficient 
as well as optimal algorithms. Tensors admit no such comparable theory 
or algorithms; indeed, some impossibility or hardness results are known for 
tensors. This is not our focus. Instead, we want to see what is algorithmically 
possible with tensors, of course, having to be less ambitious than for matrices. 
[But we seek provable error bounds.] The second purpose of this survey, then, 
is to present methods for matrices which extend to naturally to tensors. Two 
methods surveyed here - the Cut Norm introduced in section [2] and Length- 
Squared sampling procedure introduced in section [5] both have this flavor. 

The third main point of the survey is faster randomized algorithms for 
large matrices (based on sampling) than traditional numerical algorithms; 
here length- squared sampling provides a starting point. Besides improving 
running time through sampling, one operates in a model where the massive 
matrices cannot be stored in Random Access Memory, but must be read and 
sampled on the fly. Several newer sampling procedures are also covered in 
section [7J Many open questions remain both in regard to algorithms and 
computational applications. We will list them in the text, but, here we 
mention two generic challenges to highlight possible directions for research. 

Challenge 1: Find the spectrum of the web graph. The web graph 
(of hypertext links, for example) is a large "naturally occurring" graph. It 
is directed and so the adjacency matrix is not symmetric. The question is 
to find the singular values approximately. We will see that the randomized 
algorithms in the survey could be useful for this. But the provable upper 
bounds on sample size are too large. Can smaller sample sizes suffice ? We 
want a certificate on the error bound and confidence (probability of correct- 
ness) of the spectrum (say at least of the largest several hundred singular 
values) as computed for the particular matrix. [In a sense, this is seeking to 
make a Monte-Carlo algorithm into a Las Vegas one.] The motivation for 
the particular question is: there is much research on statistical properties of 
the web graph. But most, pertain to "local properties" - degree distribution 
(which is local in that we find the degree of one vertex at a time), local com- 
munities (in a neighborhood of one vertex), etc. The top singular value is a 
measure of "global" correlation. Further, the important notion of pagerank 
[BP98] as well as Kleinberg's HITS algorithm |Kle99j of course have links to 
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the spectrum of the graph. 

Challenge II Better provable Algorithms for Tensors 

The second challenge is: while the methods here make a beginning in 
dealing with tensors much research remains to be done on algorithms for 
tensors. We seek better methods for maximizing (approximately) cubic and 
higher order forms (especially when there are unusually good solutions) and 
finding low-rank approximations to tensors faster. 

Some notation For a matrix A, the Frobenius norm ||^4||f is the square 
root of the sum of the squares of its entries. The spectral norm ||A||2 is 
max x: \ x \ = i \ Ax\. The abbreviation u.a.r stands for "uniformly at random", e 
will be a positive error parameter. The material in sections H] onwards (which 
may be read more or less independently of the earlier sections) is covered 
in greater detail in the monograph [K V08] . Spectral Graph Partitioning 
(starting with Fiedler's work |Fie75] ) and many other important topics which 
are well covered elsewhere are not dealt with here. 

2 Approximation of matrices in Cut-norm, 
applications 

Instead of the traditional low-rank approximations to matrices, we start with 
a form more suited to discrete applications. For an m x n matrix, a "rect- 
angle" will mean one of the 2 m+n sets of the form (a subset of rows) x (a 
subset of columns). A "cut matrix" is an m x n matrix with all its entries 
in some rectangle equal and all entries outside this rectangle being zero; it 
is special rank 1 matrix. Our approximation to a matrix is by a sum of cut 
matrices. These approximations have the following desirable properties: 

1. It is very easy to show they exist. 

2. They can be found (and this in non-trivial) in polynomial, in fact in 
constant time (implicitly) with uniform random sample of 0(1) entries 
from the matrix. 

3. Using this, one can solve the maximum cut problem to additive error 
en 2 M on n node graphs with maximum edge weight M. [This also 
extends to all MAX-2-CSP's.] 
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4. Both the existence as well as the algorithmic version (in constant time) 
can be extended to tensors and using this, one can approximately solve 
MAX-r-SAT for fixed r and other problems. 

This is all described in this section. The main caveat for these methods is the 
error of en 2 M. There are many applications where this is not good enough. 
In the discrete optimization setting (say as in Max Cut), if M = 1, the max 
cut needs to be fl(n 2 ) or equivalently, the graph needs to be dense for this 
error to be useful. Of course, many interesting problems are not dense. A 
similar situation holds for PCA as well. Section H] describes a more amenable 
error bound for both these areas and how to achieve it using non-uniform 
sampling and thus address application issues. 

A, B will stand for m x n matrices. If S is a subset of rows and T a subset 
of columns, then, we let 

A(S,T)= J2 A ir 
ieSjeT 

Define the "cut norm" of ||^4||n by 

\\A\\a = MAK s , T \A{3,T)\. 

Lemma 1. Assume < 1. There exist 1/e 2 cut matrices whose sum B 
approximates A in the sense 

\\A — B\\u < emn. 

Proof: If ||^4||n < emn, we can take B = 0. Otherwise, there is some S,T 
such that \A(S, T)\ > emn. Our first cut matrix in B has A(S, T)/\S\\T\ in 
each entry of S x T and zero elsewhere. This subtracts from each G 
S, j G T, their average and it is easy to show that then YlieSjeT A % decreases 
by at least e 2 mn and so does Since || ■ \\p > 0, the process must 

terminate in at most 1/e 2 steps proving the Lemma. 

An immediate question is whether such a B can be found in polynomial 
time. An affirmative answer was given by Frieze and Kannan [F K99] and 
indeed, they proved that it can be found in "constant time" in a sense to 
made clear later. 

Theorem 1. lFK99f For A with \Aij\ < 1 and any fixed e > 0, we can find 
in polynomial time a matrix B which is the sum of 4/e 2 cut matrices and 
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satisfies \\A — B\\q < emn. In fact, given the entries of A in just a u.a.r. 
rectangle of size 0*(l/e 4 ) x 0*(l/e 4 ), we can find an implicit description of 
B. 

From the proof of the Lemma (pQ) , we see that it suffices to determine if the 
cut norm of A is at most emn and if not, find an S, T with \A(S, T)| > emn. 
The exact problem is NP-hard. However, for the theorem an approximate 
version suffices: find the maximum value of \A(S, T) \ to within additive error 
emn. [Really |mn, but redefine e to avoid putting /2 etc.] Reduces to two 
problems: Max A(S, T) and Max —A(S,T). We describe the ideas behind a 
polynomial time algorithm for MaxA(S, T) which we may call 
The Maximum Rectangle Problem: 

• S gives T 

If we know the maximizing S, the T to go with it is just the columns 
of A whose sum in the S rows is positive. 

• Estimate Column sums in S rows 

Pick a subset W of s = 0(l/e 2 ) rows u.a.r. The sum of each column 
in the S rows can be estimated (to additive error emn) by ~ x sum in 
the W fl S rows. 

• Exhaustive Enumeration 

Don't know S or S fl W. But, we can try each subset W of W (there 
are only 2 s of them) as a candidate S fl W. For each, find the set of 
columns- T - whose sum in the W rows is positive. 

• Choose best candidate 

For each candidate T: Let 5" be the rows with positive sum in the T 
columns. Take max A(S', T) among all candidate T. 

The Exhaustive enumeration step is inspired by an idea of Arora, Karger 
and Karpinski |AKK95j . Why does this algorithm work? We supply only a 
brief intuition. In the estimating column sums step, if we had the correct 
S fl W, the only columns on which we could be wrong about the sign of the 
column sum in the S rows are ones where the column sum is close to 0. But 
these do not contribute much to A(S, T) anyway. So, one of our candidate 
T in step 4 is correct in the sense that A(S, T) is high for the true S. The 
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last step finds the best T among the candidates; we didn't need the true 
(unknown) S; the best S for each T is just the S'. 

How do we make this all constant time ? The idea is simple : Pick a 
u.a.r subset S of s = 0*(l/e 4 ) rows and a u.a.r. subset T of s columns 
at the outset. In the above algorithm, instead of finding A(S',T) for each 
of the 2^/ e2 ) candidate T 's, estimate this quantity by ^A(S' f]S,Tn f), 
for which we only need to know the entries of A in S x T. One can show 
using H off ding- Azuma inequality that the estimate is within additive error 
at most O(emn) with high probability. [The failure probability is at most 
e -s/e _ e -i/e f or eac h candidate; so by the union bound, whp, there is no 
failure for any candidate.] Hence the best candidate is found with asserted 
error whp. 

The approximation B has many algorithmic uses. First consider the 
Maximum cut problem in an undirected graph. The problem can obviously 
be written as (with A = the n x n adjacency matrix of the graph) 

MAX x6{0 ,i } n x T A(l-x), 

where x is a column vector and 1 is the vector of all l's. We obviously have 

\x T A(l - x) - x T B(l - X )\<\\A- B\\ D . 

Here, we see why the cut norm is defined the way it is; it is the most natural 
norm for ensuring that for 0-1 vectors x, y, x T Ay x 7 By. [It is close to a 
more traditional "operator norm" - namely, it is easy to show that ||^4||n is 
always within a factor of 4 of MAX| a; | oo= |j / | oo= ia; T A?/.] So to get the maximum 
cut within additive error en 2 , it suffices to solve 

MAX ie{0 , 1}n /5(l-x). 

Since B has constant (depending only on e, not on n) rank, x T B(l — x) is 
determined by a constant number of variables, namely the components of x 
along the space spanned rows/columns of B. Thus, we have reduced the n 
variable problem to one with a constant number of variables. While there 
are many technical difficulties in solving the problem with B, conceptually, 
it is simple to argue that it can be solved in time exponential in the rank of 
B alone by enumeration: Put a fine enough grid in the row/column space of 
B. The number of grid points is exponential only in the dimension of the 
space. For each grid point, the value of x T B(l — x) is determined. It only 



6 



remains to know which grid points correspond to 0-1 x 's. This is an integer 
program; but its relaxation Linear Program turns out to suffice for the error 
we seek. 

Indeed, the attractive feature of this line is that these arguments can 
be extended in a straight forward manner to solve all dense MAX-2-CSP 
problems in polynomial time in a unified manner. Moreover, this also extends 
to approximating tensors in cut norm and that helps us solve all dense MAX- 
r-CSP problems for any fixed arity r. [Recall: In a MAX-r-CSP problem, 
one is given a list of m Boolean functions - fi, f 2 , . . . f m , each a function of 
only r of the variables. We have to find a truth setting of all variables which 
satisfies as many of the m functions as possible. MAX-r-SAT where each 
function is the disjunction of r literals is a central example.] Indeed, one 
gets: 

Theorem 2. Any MAX-r-CSP problem on n variables, where 

r is fixed, can be solved to additive error en r in constant time for fixed e > 0. 
[The running time depends exponentially on 0*(l/e 2 ).] 

For r = 2, the area of Property Testing [GGR98J proved the first such 
results by combinatorial means, often by exploiting the structure of particu- 
lar problems. A flavor of the combinatorial difficulties is seen from the first 
problem so attacked - max.cut. by DelaVega |dlV96j : Akin to the Estimating 
columns sums and Exhaustive Enumeration steps of our algorithm for maxi- 
mizing A(S,T), the property testing algorithms do the following for max.cut: 
clearly in a max cut (S, S), every vertex in S has more edges to S than to 
S. If we just picked a u.a.r. subset W, then try out all possible subsets W 
of W as candidate S D W, we could classify each vertex by whether it has 
higher degree into W or W \ W and hope these would be respectively S and 
S vertices. But this does not work and indeed, the property testing based 
max cut algorithms work hard to fix this. In our setting, S, T are subsets 
of different sets and so are "decoupled" and this is what makes these steps 
work so simply here. 

For general r > 2, Andersson and Engebretsen |AE02] have given a purely 
combinatorial appraoch independently to prove theorem (j2J) too. Our ap- 
proach to proving the theorem is based on an extension of cut matrices and 
norm to tensors which we describe in the next section. 

Another application of the cut norm and approximation is to a version of 
the Szemeredi Regularity Lemma for graphs. For a graph G(V, E) with edge 
weights, we denote by A G the (weighted) adjacency matrix. For two graphs, 



7 



G, G', on the same set of n vertices, we define a distance between them by 
d D (G, G>) = £ max 5 , T cy \A G (S, T) - A G ,(S } T)\. 

From Lemma [U the following Lemma which is often called the Weak 
Regularity Lemma can be proved: 

Lemma 2. [FK99] The vertex set of a graph G(V, E) with all edge weights 
equal to 1 can be partitioned into 2 0{ - l l e > subsets V±, V2, . . . so that the graph 
G' in which for each edge with say, i £ V r ,j £ V s , has weight = 

(number of edges between V r and V s )/\V r \\V s \ has d a (G, G') < e 

Note that G' intuitively behaves like a random graph with edge probabil- 
ities given by edge weights in that the number of edges between subsets of 
vertices would be close to the expected numbers. A constructive version of 
the Szemeredi Regularity Lemma was shown in |ARH + j . A simpler spectral 
algorithm is developed in [FK999J. 

There is an abstract (and improved) version of this lemma due to Tul- 
siani, Trevisan and Vadhan |TTV09] from which they are not only able to 
derive this result, but several others, like the Dense Model theorem of Green, 
Tao and Ziegler |GT08j . A recent result of Bansal and Williams [BW09J 
makes progress on the classical problem of complexity of Boolean Matrix 
multiplication; they use Lemma [2j We will describe in the next section an 
application of weak regularity to graph limits. 
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3 Approximation of Tensors in Cut norm and 
applications 

Recall that a r— tensor is an r dimensional array A with entries Ay&j.... A 
"rectangle" is a set of entries of the form Si x S 2 x . . . S r , where S t is a subset of 
the t th index. [For r = 2, Si is a subset of rows and £2 a subset of columns.] 
A(Sx, S 2 , . . . Sr) is the sum of A 's entries in the rectangle Si X S2 x . . . S r . 
We define the cut norm ||^4||n exactly as for matrices - it is the maximum 
over all rectangles of |A(Si, S 2 , . . . S r )|. A cut tensor has the same entry in 
some rectangle and is zero elsewhere. Lemma 1 carries over with exactly the 
same simple proof as for matrices. 

Lemma 3. [FK99] For a r— tensor A with n\ x n 2 x . . . n r entries, each at 
most 1 in absolute value, there exist 1/e 2 cut tensors whose sum B approxi- 
mates A in the sense 



Interestingly, the constructive version also carries over with one extra 
twist. The idea for solving the maximum rectangle problem is: 

• Want Si, S 2 , . . . S r so that A(Si, S 2 , . . . , S r ) is max. 

• If we knew the maximizing Si, S 2 , . . . S r _i, then the maximizing S r 
consists of the i with 



• We can estimate this by taking random subsets 
Wij W 2 , ■ ■ ■ W r -i, trying out all subsets 

Wi, W 2 , ■ ■ ■ W r _i of the respective W t as candidate S t H W t . 

• This gives us many candidate S r . How do we find the best one ? This 
needs the extra twist: define a r — 1 tensor A by 



Now recursively solve the maximum rectangle problem for the r — 1 
tensor. Then choose the S r with best answer. 



A — B\\ a < enin 2 . . . n r . 



A(S 1 ,S 2 ,...,S r - 1 ,i) > 0. 




9 



The above arguments can be used to show: for a MAX-r-CSP formula 
F(xi, x 2 , ■ ■ ■ , x n ) if we pick a u.a.r. subset Q with \Q\ = q = poly(l/e) 
variables and solve the "induced" MAX-r-CSP problem on the picked 
variables (the induced problem contains only those clauses all of whose literals 
are the picked variables or their negations), then we have whp: 



— MAX(F Q ) - MAX(F) 



< erf 



where Max(F) denotes the maximum number of functions in F which can 
be simultaneously satisfied. [*V is a natural scaling factor. Note that this 
is interesting only when the answer to the whole problem is at least Q(ri r ). 
This holds for "dense" problems where there are Q(n r ) clauses.] The question 
arises: what is the best poly(l/e) in this result? Alon, delaVega, Karpinski 
and Kannan [ADKK02] prove 0*(l/e 4 ) suffices. 

Theorem 3. IADKKO0I Suppose F(x h x 2 , . . . , x n ) is a MAX-r-CSP for- 
mula. If Q is u.a.r. subset of q = 0*(l/e 4 ) of the n variables, then, for 
F®, the induced formula on Q, we have with probability at least 99/100, 



71 

MAX(F) MAX(F Q ) 



< en r . 



There are two parts to the theorem: first asserts that 
^MAX(F Q ) > MAX(F) - m r . This is simple: if one just takes the truth as- 
signment to {xi, X2, ■ ■ ■ , x n } which attains MAX(F), then usual facts about 
sampling can be used to prove that the SAME assignment to the sampled 
variables satisfies (whp) at least ^rMAX^) — eq r of the clauses of F®. The 
other part is the non-trivial one - the reason is we have to rule out ANY as- 
signment to the sampled variables from satisfying too many clauses. Indeed, 
this raises a basic question: 

When can we say for a maximization problem that a sampled induced sub- 
problem gives a good estimate of the answer to the whole problem ? 
The non-trivial part is: show that for small problem (the induced one on 
the sample), no solution gives an unduly high value. [Traditional sampling 
arguments tackle the other part easily.] A situation with a simple answer 
is bounded Linear Programming: it is easy to see that if a system of linear 
inequalities Ax < b; < x^ < 1 in n variables has a solution, then, for a 
random subset Q of the n variables, the induced problem (slightly relaxed) 
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has a solution too: A Q x Q < + 5; < X{ < 1 for i G Q where A Q consists 
of the columns of A corresponding to the Q variables. But the converse is 
also true here using LP duality: if Ax < b has no solution with < Xi < 1, 
then duality tells us that there is one combination of the inequalities which 
has no solution; this is equivalent to the existence of a u > such that 
^2j(u T A)j > u T b. Now for this u, we can show by traditional sampling that 

we have: ^jgq( mT ^)7 > n( uT ^) — ^ demonstrating that there is no solution 
to a slight tightening of the sampled LP. This simple result for LP is used as 
part of the proof of Theorem El 

Another result of a similar flavor about induced subproblems also goes 
into the proof of Theorem [3] and is worth mentioning independently here. 
Suppose A is a large n x n (note: it is square) matrix. If we pick a random 
subset Q of [n] and look at the induced submatrix A® of A on Q x Q, how 
does the cut norm of A® relate to the cut norm of Al It is easy to see that 
||^4 Q ||n > ^||^4||n ~ 8, where 5 is small, since, we could take the subsets 
S, T of [n] which maximize \A(S,T) \ and argue by traditional Statistics that 
\A{S D Q,T n Q)\ > &\A(S,T)\ - 5. The theorem below by Rudelson and 
Vershynin asserts a converse which is harder to prove. It is an improvement 
of a theorem in [ADKK02] and is proved using some Functional Analysis 
techniques. 

Theorem 4. IRV07 1 Let e > and suppose A is an n x n matrix with 
\\A\\p G 0{n) ; \\A\\u € 0(en 2 ) ; \Aij\ < 0(l/e). Then if Q is a u.a.r. subset 
of {1, 2, ... , n} with \Q\ = q G f2(l/e 2 ) and A® is the q x q submatrix of A 
with entries from Q x Q, then 

E\\A% u = 0{eq 2 ). 

Open Question [ADKK02] actually proved such a result for r— tensors 
for any fixed r; but their proof required q G f2*(l/e 4 ). Does a result as above 
with 0(l/e 2 ) hold for r— tensors ? The issue is that the techniques from 
Functional Analysis are no more available for r > 2. (cf. also the next open 
question on approximating the cut norm has this flavor.) 

It is not difficult to show that the problem of finding the cut-norm is 
MAX-SNP hard by a reduction from Max-Cut. Interestingly, using a deep 
result from Mathematics called Grothendik inequality, Alon and Naor |AN06] 
were able to show: 

Theorem 5. lAN06f The cut norm of matrices can be approximated to within 
a factor of 1.782 in polynomial time. 
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Their approach is: the cut norm problem can be reduced to the following 
problem: 

MAX AjjXiUj subject to Xi,yj G {— 1,+1}. 

This Integer Program has a standard Semi-Definite Programming relaxation, 
where the ±1 variables Xi,yj are replaced by vector variables Ui,Vj, required 
to be of length 1: 

MAX Ajj (u; L ■ Vj) subject to |uj| = \vj\ = 1. 

The theorem of Grothendik proves that the optimal value of the SDP is 
at most a factor of 1.782 times the optimal value of the integer program. 
This automatically yields a constant factor approximation to the value of 
the integer program. But, finding a rounding procedure to achieve this was 
both non-trivial and first developed in [A~N06j. 

Open Problem Develop 0(1) factor for cut norms of r-tensors. Note that 
the natural Semi Definite Program for r = 2 does not extend to r = 3. 

In the last section, we saw the weak-regularity Lemma and a notion of dis- 
tance between two graphs. Borgs, Chayes, Lovasz, Sos, Szegedy and Veszter- 
gombi [B CL + 06] defined other interesting notions of graph distances and 
graph limits. They generalize the notion of graph distances to graphs with 
different numbers of vertices. While the definition in Lemma |2] viewed the 
vertex sets of the two graphs having a fixed 1-1 mapping (labeled vertices), 
this is no more possible and relabeling of each vertex set as well as mapping 
one to the other have to be allowed. We do not give the precise definitions 
here. Suppose we do the partition of the vertex set as in Lemma [2j Then 
we could represent each V r by a compound vertex and put an edge between 
compound vertices r, s of weight equal to (number of edges between V r and 
Vs)/|K-||K|- Then the Lemma is really saying that the compressed graph and 
the original are close in some metric. [This intuition needs to be formalized 
into a definition of distance between graphs.] They then use Theorem [3] to 
show: 



Theorem 6. !BC L + 06 l Let G be a simple graph and e, 5 > 0. Then the 



induced sub-graph of G on a random subset of 2 1//5e2 nodes is e close to G 
with probability at least 1 — 5. 

They use this theorem in their extensive work on Graph Limits. Their 
notions facilitate the understanding of very large graphs which can be viewed 
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as limits; but also can be approximated in the above sense by smaller graphs 
(with something akin to "compound vertices".) This raises the following 
somewhat loosely phrased: 

Open Question Can we define a notion of limits for matrices and more 
generally r— tensors and apply these to derive theorems similar to [BCL + 06 

? 

4 Non-Uniform Sampling 

Clearly, uniform sampling of rows/columns will not solve all problems. In- 
deed, if we have a matrix with just one non-zero row and all other rows were 
just O's and we draw uniform sample of rows, we are likely to see only zeros 
and miss the all-important row. Less trivial examples are when only a small 
number of rows contain significantly higher absolute value entries than others. 
From the last sections, we can show that u.a.r. samples yield an approxima- 
tion B to the given matrix A with error in cut norm of at most emnM, where 
M is the max absolute value of an entry of A. It can also be shown that 
with poly(l/e) u.a.r. samples, we can make \\A — B\\^ < e^mnM, the point 
being (briefly) \\A — B\\\j — x T (A — B)y, where \x\ = y/m and \y\ = \/ri, so 
the 1 1^4. — .S 1 1 2 error is 1/y/mn times the \\A — B\\a error. But this amount 
of error is not suitable for many applications. 

A more useful error bound is given in Lemma (jl]), for not only matri- 
ces, but also tensors. Completely analogous to the matrix case, we make 
the following definitions: For an r— tensor A, and r vectors w,x,y, z, . . ., 
A(w, x, y,z,.. .) is defined as J2i,j,k,i,... Ai,j,k,i,...WiXjy k zi . . .. [It is analogous to 
the quadratic form x T Ay = Y2i j AijX^yj for matrices.] The Frobenius norm 
of A, denoted \\A\\ F is again the square root of the sum of squares of the 
entries. The "spectral norm" of A denoted 1 1 ^4 1 1 2 is the maximum over all 
unit length vectors w, x, y, z, . . . of A(w, x, y, z, . . .). A rank-1 r— tensor is 
the outer product of r vectors, denoted w <E> x ®y <8> z . . . whose i, j, k,l, . . .th 
entry is WiX^y^zi .... We say that a tensor has rank at most k if it can be 
expressed as the sum of k rank-1 tensors. 

Lemma 4. IdlVKKVOty For any A, e > 0, there exist a tensor B of rank at 
most 1/e 2 such that 

\\A- B\\ 2 < e\\A\\ F . (1) 
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The simple proof of the Lemma will be given shortly. A polynomial 
time sampling based algorithm is also available for producing such an ap- 
proximation, but the algorithm given by delaVega, Karpinski, Kannan and 
Vempala[dlVK KV05] is non-trivial. 

Theorem 7. ldlVKKV05\/ For any A,e > 0, we can find a tensor B of rank 
at most 4/e 2 in time [n/e) ^ 1 ^ ) such that with probability at least 3/4 we 
have 

\\A- B\\ 2 < e\\A\\ F . 

For matrices, traditional singular value decomposition gives us a polyno- 
mial time algorithm, but, we will see a sampling-based algorithm which in 
essence can be made constant time after 2 passes through the matrix. In 
the case of tensors, no previous polynomial time algorithm was known at 
all. Before giving the proofs/algorithms, we will motivate the error bound of 
| \A — B\ I2 < e\ I A\ \f of ([1]) by three application 

The first motivating area is Principal Component Analysis (PCA). Here, 
one often assumes that the top "few" singular values dominate. (In fact, 
that is in the first place one of the two justifications for making a low rank 
approximation. The other possible motivation for making a low-rank approx- 
imation is "de-noising" - where one assumes that the top few singular value 
components are the real data and the others are possibly noise- for example 
in Latent Semantic Indexing [DFLD 88J.) 

PCA Assumption: The data consists of an m x n matrix A. The top k 
singular values a±, c 2 , . . . contain 1— e of the "spectrum" , where k << m,n. 
More precisely, 

[Strong-PC A]cr 2 + a\ + . . . + a\ > (1 - e)\\A\\ 2 F . 

We need only a weaker version of this: 

[Weak-PC A] a" + a 2 + . . . + a\ > n{\\A\\ 2 F ). 

Under this assumption, ([1]) translates to a "relative error e" . 

A second area is Discrete Optimization. As we saw, the max-cut problem 
can be solved to additive error en 2 M for n— node graphs where the edge 
weights are all at most M. This however is relative error e only in case the 
total of all edge weights is Q(n 2 M); if M < 1, this requires the graph to be 
dense. This raises the question: 
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Can we solve non-dense max cut problem to relative error e ? In general, 
these problems are NP-hard. But an important special case, it turns out 
can be solved in polynomial time - namely when the edge weights satisfy 
the triangle inequality, as was shown using other met hods |dlVKOl] . A uni- 
fied polynomial time algorithm using Theorem [7J for this problem and other 
weighted versions of MAX-2-CSP problems is developed in |dlVKK V05] . [In 
fact, they do this with a weaker condition than triangle inequality for all 
MAX-2-CSP problems.] 

A third area is tensors. Many algorithms are known and used in practice 
for finding low-rank approximations to tensors [Kru89j. But as remarked 
earlier, neither the theory nor the algorithms are anywhere as nice as for 
matrices. There are solid reasons - NP-hardness [Has90j, |HhL09j and non- 
uniqueness/existence. But beyond all this, is a basic question - what is it that 
we can find provably in polynomial time ? Theorem (J7|) seems to be a first 
step. The algorithm for Theorem ([7]) (which we will outline soon) is quite 
different from other known heuristics and draws on new uses of sampling in a 
vein somewhat similar to the maximum rectangle problem. Also, it turns out 
that the error bound in ([T]) suffices to tackle MAX-r-CSP problems where the 
weights satisfy a natural generalization of the triangle inequality to higher 
dimensions. Unweighted dense MAX-r-CSP's are a special case of this. 

Proof of Lemma (jl]): If 1 1 ^4 1 1 2 < e||A||^, then we are done. If not, there 
are w,x,y, z, . . ., all of length 1 such that 

A(w,x,y, z, ...)'> €\\A\\f. Now consider the r— dimensional array 

B = A — (A(w, x, y,z, . . .)w <g> x <g> y <g> z 

[This is of course basically a rank-1 update.] It is easy to see that ||-B|||, = 
\\A\\p — (A(w, x, y, z, . . .) 2 ). We may repeat on B and clearly this process 
will only go on for at most 1/e 2 steps. 

From the proof of the lemma, it is clear that again, the basic algo- 
rithmic question is to find a w, x, y, z, . . . all of unit length, maximizing 
A(w, x, y, z, . . .) to within additive error e||A||ir. We will present the al- 
gorithm for tensors later. First, we will tackle matrices by sampling. 

5 Sampling in large matrices 

Numerical Analysis gives us sophisticated polynomial time algorithms for 
many matrix problems to do with spectral analysis. Here, the focus is on 
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using sampling to solve very large matrix problems approximately. First, 
we look at matrix multiplication. The product of two matrices A, B can be 
written as 

AB = J2^B\ 

i 

where Ai (B 1 respectively) is the i column of A (row of B, respectively). 
An immediate thought is to estimate the sum from a random sample of i 's. 
Consider a random sample of s i 's picked in i.i.d. trials. Let pi,P2, ■ ■ -Pn be 
the probabilities of picking 1,2,. . .n respectively in each trial. If %\, i 2 , ■ ■ ■ , i s 
are the samples, then 



is easily seen to be an unbiased estimator of AB. [I.e., EX = AB entry-wise.] 
We would like to measure the variance, but this quantity depends on which 
entry we are talking about. Here, we make a simple-minded, important 
decision - lets look at the sum of variances of all the entries of X. This 
quantity, which we denote VarAT is seen to satisfy: 



n 

?i|2 



VarX < - V -\Ai\ z \B l 

S ^-^ Vi 

A case of much interest is when B = A T , when this simplifies to 



n 



s S7p 

By Calculus, one can see that this is minimized when the pi are proportional 
to \Ai\ 2 . [Indeed it is not hard to show that these pi are the minimizer of the 
actual variance, not just the upper bound here.] This leads to the following 
probability distribution for sampling the columns of a matrix which turns 
out to have many nice properties: 

Length Squared Sampling : Pick a column with probability pro- 
portional to sum of squares of its entries. 

Length squared sampling was first introduced by Frieze, Kannan and 
Vempala [FKV98] . Its applications to clustering were studied by Drineas, 
Frieze, Kannan, Vempala and Vinay [D FK + 04] . The application to matrix 
multiplication is from [DK01] . See also [DKM06aj . 
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[FKV98J proves that if we draw a sample of columns according to the 
length squared distribution and do an SVD on the sampled columns, this 
gives an low-rank approximation to A with provable error bounds. We state 
this below (without proof). 



Alg 


orithm: Fast-SVD 


1 


Sample s columns of 




A from the squared 




length distribution to 




form a matrix C. 


2 


Find u^\...,u {k) , the 




top k left singular 




vectors of C. 


3 


output J2t = i u(t)u(t)T A 




as a rank-k 




approximation to A. 



The matrix Ylt=i u®u® A is really just the "projection" of A on the 
space spanned by the vS*' and so the theorem below says that A projected 
to the top singular space of C (instead of the usual singular space of A) is 
a good low-rank approximation to A. [A k is the best rank k approximation 
given by SVD.] 

Theorem 8. IFKV 98], ' L F KV0^ The rank-k matrix found by Algorithm Fast- 
SVD (call it A) satisfies: 

E (\\A-A\\ 2 F ) < \\A-A k \\l + 2^-\\A\\ 2 F 

E (\\A-A\\i) < \\A-A k \\ 2 + ^=\\A\\ 2 F . 

In fact the kind of error bound in the theorem is optimal in terms of the 
number of rows sampled; this was shown in |BY03j . 

[FKV98J and [FKV04] in fact apply the sampling once more - to pick a 
sample of rows of C according to the length-squared distribution. Then, it 
turns out that fining the SVD of the const ant- sized matrix (with the sampled 
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rows of C) suffices to give us a low-rank approximation to A. But the proof 
of this is more complicated. The reason is that from the sampled rows of 
C, one gets the right singular vectors of C, but only approximately. The 
error turns out to be bounded by e||C||.p. [It would be better if the error 
bound was relative, in terms of the singular values themselves. But length- 
squared sampling does not give this.] Then for the "low" singular values of 
C (less than e||C||i?), the approximation is no good. So, one has to throw 
out these low ones (these are in a sense "near-singularities") See |DKM06b] 
for a detailed explanation of the method and some improvements. 

An improvement of the error bound, still using length-squared sampling 
was achieved using sophisticated techniques from the field of Probability in 
Banach spaces by Rudelson and Vershynin. Their result stated below picks 
a sample of s rows from A, where s is almost linear in a quantity r, they call 
the numerical rank of A; r = \\A\\ F / \\A\\2- [Recall the PCA assumptions; 
under even the weak PCA assumption, r is 0(1).] Their error bound is also 
better in that it involves IIAIU, rather than IIAIIp. 



Theorem 9. RVOT^ Suppose A is an mx n matrix with numerical rank r 



\\A\\ 2 F /\\A\\l. Let e, 5 E (0,1) and s G VL*{r / e A 5). Let B be a set of s rows of A 
picked in s i.i.d. trials, each according to length- squared and let vS^\ . . . , vP*' 
be the top k right singular vectors of B. Then, A = AY^t=i u ^ u( '^ T satisfies 
the following with probability at least 1 — 2e~ c ^ s : 

\\A-A\\ 2 < a k+1 (A) + e\\A\\ 2 . 

Note also that this is a high probability (with exponential tails) rather 
than just in expectation. 

In another application of length-squared sampling, |SV09] shows that if 
we run an iterative equation solver for an overdetermined system of equations 
with a Kaczmarz iteration (where one uses a violated equation to modify the 
current solution) with the added twist that the violated equation is picked 
according to the length squared distribution, then one gets a guaranteed rate 
of convergence; the reader is referred to the paper for details. 

Length-squared sampling can also be used for tensors and is the basic 
ingredient in the proof of theorem [7J recall that we needed an algorithm to 
find for a tensor A, the maximum value of A(w, x, y,z, . . .) to within e\\A\ \p. 
The idea behind the algorithm for this is to imitate the steps of the algorithm 
for the maximum rectangle problem: 
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1. If we knew the optimizing x,y,z,..., then the optimizing w is easy 
to find: it is just the vector A(-,x, y,z, . . .) (whose i th component is 
A(ei, x, y, z, . . .)) scaled to length 1. 

2. Now, A{e h x, y, z, . . .) = Y^j,k,i,..Ahi,k,i,- x jyk z i ■ ■ •■ Tne sum can be 
estimated by having just a few terms. But, an important question is: 
how do we make sure the variance is not too high, since the entries can 
have disparate values ? 

3. Length squared sampling works ! [Stated here without proof.] 

Achlioptas and McSherry [AM07] developed a different randomized algo- 
rithm for low-rank approximations of matrices - they sample individual en- 
tries independently and show using Random Matrix theory that with those 
on hand, we can get a good approximation to the matrix in spectral norm. 
Their results also have a bearing on "Compressed Sensing" in that they are 
able to infer something about the whole matrix from a random sample of 
entries. 

6 CUR: An interpolative low-rank approxi- 
mation 

We found in the last section an implicit low-rank approximation to A; implicit 
because, the actual approximation needed us to multiply A by the vectors 
u®. In this section, we wish to describe an algorithm to get an explicit 
approximation of any matrix A given just a sample of rows and a sample 
of columns of A. Clearly if the sample is picked according to the uniform 
distribution, this attempt would fail in general. We will see that again the 
length squared distribution comes to our rescue; indeed, we will show that 
if the samples are picked according to the length squared or approximate 
length squared distributions, we can get an approximation for A. Again, this 
will hold for an arbitrary matrix A. 

First suppose A is a m x n matrix and R (R for rows) is a s x n matrix 
constructed by picking s rows of A in i.i.d. samples, each according to ap- 
proximate length-squared distribution. Similarly, let C (for columns) be a 
m x s matrix consisting of columns picked according to the length squared 
distribution on the columns. The motivating question for this section is: 
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Can we get an approximation to A given just C,R7 An affirmative answer 
is given in the theorem below first proved by Drineas and Kannan. Here, 
one does not need the sampling probabilities to be exactly proportional to 
length squared; it suffices to have the probability of drawing column i to be 
at least its length squared / (c.||A||f-), where c is a constant. We call this 
approximate length squared sampling. 

Theorem 10. [DK03], [DKM06c] Suppose C (respectively R) consists of a 
sample of s > fl*(k/e 4 ) columns (respectively rows) of A drawn in s i.i.d. 
trials, each according to (approximate) length squared probabilities. Then, 
from C and R, we can find a s x s matrix U so that 

E\\A-CUR\\ F < \ \A- A k \\ F + e\\A\\ F 
E\\A-CUR\\ 2 < \\A- A k \\ 2 + e\\A\\ F . 

Open Problem Improve the dependence of 1 /e 4 in the theorem. 

The approximation of A by the product CUR is reminiscent of the usual 
PCA approximation based on taking the leading k terms of the SVD de- 
composition. There, instead of C, R, we would have orthonormal matrices 
consisting of the leading singular vectors and instead of U, the diagonal ma- 
trix of singular values. The PCA decomposition of course gives the best 
rank-fc approximation, whereas what the Theorem shows for CUR is only 
that its error is bounded in terms of the best error we can achieve. There 
are two main advantages of CUR over PCA: 

1. CU R can be computed faster from A and also we only need to make two 
passes over A which can be assumed to be stored on external memory. 

2. CUR preserves the sparsity of A - namely C, R are columns and rows 
of A itself. (U is a small matrix since typically s is much smaller than 
m, n). So any further matrix vector products Ax can be approximately 
computed as C{U(Rx)) quickly. 

3. It is an interpolative approximation : unlike SVD, where the singular 
vectors are linear combinations of A's columns/rows, here C, R 

tual columns/rows of A. An application illustrates the point. In doing, 
say, PCA on a "patient-gene" matrix in a Biological application (with 
entry i,j giving the gene expression of gene j for patient i), one gets a 
result which says that "these few linear combinations of genes/patients 
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are important in explaining the data", where the linear combinations 
may involve negative weights as well as positive ones. Instead in CUR, 
we get a collection of individual genes/patients explaining the data, 
arguably providing better intuition. See for example |PMJ + 07] for this 
kind of application. 

The CUR approximation has been extended to tensors as well by Ma- 
honey, Maggioni and Drinesa [MMDJ. There are many improvements and 
applications of CUR - see [MD09J. An important modification is given by 
Sun Xi, Zhang and Falustos [SXZF07J. They make the observation that 
the length squared sampling used in the original CUR algorithm may re- 
sult in a lot of duplicates. They remove duplicates and reweight the unique 
columns/rows remaining by the square root of the number of copies. They 
show that with this new reweighting, one gets good approximations. More 
importantly, they have done empirical studies with Datamining applications 
to show the effectiveness in terms of space and time with the new approxi- 
mation. 

We also briefly describe an application of CUR to Recommendation Sys- 
tems by Drineas, Kerneidis and Raghavan |DKR02j . The central object in a 
Recommendation System is the customer-preference matrix whose th 
entry is the preference of customer i for product j. The basic question is: 
given a small sample of entries from the matrix (collected customer-product 
data) how does one make good recommendations to other customers. The 
idea of using CUR in |DKR02] is a departure from what we have discussed 
so far. So far, we had (somewhere) the whole matrix and used sampling 
to infer its properties, it being too large to deal with in full. Here, even 
collecting the matrix is expensive; we wish to infer its missing entries from 
the sample we know. But note that the techniques of this survey do apply 
to this "reverse engineering" problem. Indeed, [DKR02J prove that under 
some assumptions, we can make good recommendations with just a handful 
of customers' complete preferences (some complete rows) and all customer 
preferences for a handful of products (columns). Much care has to be taken, 
since the sampling cannot be assumed to be according to Length-squared. 

The above raises a more general question. Can we bring costs of accurate 
data collection into our measures of efficiency ? We loosely formulate a 
candidate problem on these lines: 

Open Question Suppose we wish to solve a large Linear Program : 
maxc • x subject to Ax < b. But suppose we have to collect each piece 
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of data - each Aij costs c/e 2 to get with accuracy ±e. Find an efficient 
method for data collection + solution to within error ±<5, where the cost is 
the weighted sum of the data collection cost plus the running time. 

7 Relative Error Approximation 

If Ak is the best rank k approximation (from SVD), then a natural way to 
measure error of any rank k approximation is relative to the "residue" of the 
spectrum, namely relative to \\A — A^Wf- So, we say a rank k approximation 
B makes relative error e if 

\\A-B\\ F < {l + e)\\A-A k \\ F . 

From simple examples, it is easy to see that length-squared sampling does 
not do this in general. 

For multiplicative (1 + e)-approximation, Har-Peled | HP0 5j gave a linear 
time algorithm that requires O(logn) passes over the input matrix. Desh- 
pande and Vempala |DV06] improved this to 0{k) passes using volume sam- 
pling. Both these algorithms use adaptive sampling of |DRVW06] as a sub- 
routine. Drineas, Mahoney and Muthukrishnan |DMM06] gave a different 
algorithm, where the sampling probabilities are computed using SVD, that 
achieves the same approximation ratio but takes more time because of the 
initial computation of SVD. Finally, Sarlos |Sar06j gave a linear time 2-pass 
algorithm for multiplicative (1 + e)-approximation that uses a small number 
of linear combinations of all the rows instead of a subsample; his algorithm 
which we call isotropic random projection is described below. 

First, volume sampling is a generalization of length-squared sampling. 
We pick subsets of k rows instead picking rows one by one. The probability 
that we pick a subset S is proportional to the volume of the fc-simplex A(S) 
spanned by these k rows along with the origin. The raw method will give us 
a factor (k + 1) approximation (in expectation). Incidentally, it also proves 
that any matrix has k rows whose span contains a such an approximation. 
Moreover, this bound is tight, i.e., there exist matrices for which no k rows 
can give a better approximation. 

Lemma 5. [DV06] Let S be a random subset of k rows of a given matrix A 
chosen with probability 

Vol(A(S)) 2 

S Y.T:\T\=k Vol(A(T)) 2 ' 
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Let A be the projection of A to the span of S and let A k be the best rank k 
approximation to A. Then, 

E{\\A-A k \\ 2 F )<(k + l)\\A-A k \\ 2 F . 

More work is needed to convert this to a relative error e approximation 
(which we do not describe here.) 

Isotropic random projection also gives relative error approximations to the 
optimal rank-/c matrix with roughly the same time complexity. Moreover, it 
makes only two passes over the input data. 

The idea behind the algorithm can be understood by going back to the 
matrix multiplication algorithm described in section [5j There to multiply 
two matrices A, B, we picked random columns of A and rows of B and thus 
derived an estimate for AB from these samples. The error bound derived 
was additive and this is unavoidable. Suppose that we first project the rows 
of A randomly to a low-dimensional subspace, i.e., compute AR where R 
is random and n x k, and similarly project the columns of B, then we can 
use the estimate ARR T B. For low-rank approximation, the idea extends 
naturally: first project the rows of A using a random matrix R, then project 
A to the span of the columns of AR (which is low dimensional), and finally 
find the best rank k approximation of this projection. The algorithm is: 

1. Let I = Ck/e and R be a random n x I matrix; compute B = AR. 

2. Project A to the span of the columns of B to get A. 

3. Output A k , the best rank-A; approximation to A. 

Theorem 11. ISarOty Let A be an m x n real matrix with M nonzeros. Let 
< e < 1 and R be an n x I random matrix with i.i.d. Bernoulli entries 
with mean zero and I > Ck/e where C is a universal constant. Then with 
probability at least 3/4 ; 

\\A-A k \\ F < (1 + e)\\A - A k \\ F 

and A k can be computed in two passes over the data in OiMl + (m + n)/ 2 ) 
time using 0((m + n)r 2 ) space. 
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8 Applications to Clustering, Mixtures 



Spectral methods are widely used for clustering and partitioning problems. 
Not many worst-case results have been proved; there are exceptions for spe- 
cial classes of graphs like planer graphs |ST07j . In general, spectral methods 
have been proven to work correctly with high probability under some as- 
sumptions on the generative model of the data. One class of such problems 
is the Planted Problems, where we are given a random graph modified by a 
"planted part" and the objective is to find the planted part. We describe an 
instance of this. 

Consider a graph G which is the union of a purely random graph 
and an unknown clique on vertex set P, where p — \P\ is given. The problem 
is to recover P. If p > c(n logn) 1 / 2 then with high probability, it is easy to 
recover P as the p vertices of largest degree. Alon, Krivelevich and Sudakov 
|AKS98j . using spectral analysis, were able to improve this to p = Vt{n 1 ^ 2 ). 

Let Aq denote the adjacency matrix of G. The spectral approach of 
|AKS98j essentially maximizes x 1 Ax over vectors x with \x\ = 1, expecting 
that the optimal solution is close to u, defined by Ui = p _1 / 2 lj e p, (u is the 
scaled characteristic vector of P) so that we may recover P from the optimal 
solution. 

Frieze and Kannan |FK08] define a natural 3-dimensional array A related 
to the given graph : A^ will be ±1 depending on whether the parity of the 
number of edges among the vertices k is odd or even respectively. They 
show that as long as p G 0(n 1//3 (log n) 4 ), the maximum of the cubic form 
^ijk A ijkXiXjX k as the vector x varies over the unit ball is attained close 
to u, so that if we can find this maximum, then we can recover the clique. 
However, unlike the case of the quadratic form, where the maximization was 
an eignevalue computation which is well-known to be doable in polynomial 
time, there are in general no known polynomial time algorithms for maxi- 
mizing cubic forms. So, the existential result does not automatically lead to 
an algorithm and this is left as an open question. 

Open Question Suppose a n x n x n array A is constructed as above 
from G n i/2 plus a planted clique of size p £ f2(n 1 / 3 (logn) c ). Then can we 
maximize the function J2i j kAijkXiXjXk, \%\ < 1, even within 0(1) factors in 
polynomial time ? 

Brubaker and Vempala [BV09j have generalized this to r tensors, where 
they show that maximizing over an r tensor whose entries are the parity 
of the number of edges in r cliques can find hidden cliques of size n}l r . 
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The computational question of approximately maximizing the r— ary forms 
is open. 

Another class of models is in a sense also planted - there is a hidden 
partition which dictates probabilities. For example, McSherry [McSOlj (fol- 
lowing earlier papers) considers a model in which n objects are divided into 
k clusters (k << n.)Ti, T 2 , . . . T^. There is a number p rs 6 [0,1] which is 
the probability of each edge between a vertex in T r and one in T s . Edges 
are chosen independently and we are given the resulting random graph on 
n vertices. Our job is to find the partition and p TS of the generating model. 
This can be summarized as: we are given a 0-1 matrix A and are to find 
EA, where the expectation is entry-wise. |McS01j shows that under some 
technical conditions, spectral methods will yield the answer. Here is a quick 
idea of the method and its use of the deep results from the theory of ran- 
dom matrices. The matrix A — E A has random independent entries each 
with mean 0. The following celebrated theorem was first stated qualitatively 
by the physicist Wigner and proved by Fiiredi and Komlos (FK81j . See also 
[VuOS] . 

Theorem 12. Suppose A is a symmetric random matrix with independent 
(above-diagonal) entries each with standard deviation at most v and bounded 
in absolute value by 1. Then, with high probability, the largest eigenvalue of 
A — E A is at most cvy/n. 

The strength of this Theorem is seen from the fact that each row of 
A — E A is of length 0(yy/n), so the Theorem asserts that the top eigenvalue 
amounts only to the length of a constant number of rows; i.e., there is almost 
no correlation among the rows (since the top eigenvalue = maxui = i \\{A — 
E A)x\\ and hence the higher the correlation of the rows in some direction 
x, the higher its value). Thus one gets whp an upper bound on the spectral 
norm of A — EA: 

\\A - E A\\ < cuy/n. 

Now, we can (approximately) find EA by doing SVD on A with the help of 
the following simple lemma. 

Lemma 6. [AM01] Suppose A,B are m x n matrices with rank(B) = k. If 
A is the best rank k approximation to A, then 

\\A-B\\ 2 F < 5k\\A-B\\ 2 . 
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While these ideas are clean, it turns out that they only help cluster "most" 
points correctly. The others are corrected in a messy "clean-up" phase. There 
has been progress on clustering in generative models: [AFKMOlj , [DHKS05J , |DHKM 07j . 
But the messiness of the clean-up phase haunts the field and raises the fol- 
lowing: 

Open Problem Clean up the clean-up phase of clustering algorithms for 
generative models (or dispense with it). 

Another well-studied clustering problem has to do with learning mixtures 
of Gaussians and other probability densities. We only describe the part of 
this area which has to do with spectral algorithms. A provable connection to 
spectral method was struck by Vempala and Wang [VW04j . They proved an 
elegant result that given samples from a mixture of k spherical Gaussians, 
the k— dimensional SVD subspace of the matrix whose rows are the samples 
contains all the centers. With the space of centers in hand, one can project 
to that subspace and learn in it. Extensions of this were given in [KSV08J 
and [A~M05j . Two interesting variants of PCA have been proposed recently 
by Brubaker and Vempala - isotropic PCA [BV08] and robust PCA |Bru09] 
which tackle Gaussian mixture learning problems not amenable to standard 
PCA. 

While we do not go into the subject of Spectral Partitioning of graphs, 
we briefly mention that there are many ways to partition nodes given edge 
weights which are to be treated as pairwise similarities between vertices. A 
well-used method is to normalize first each row sum to be 1, then find the 
second largest eignevalue and corresponding eigenvector of the the stochastic 
matrix. Then we partition the vertex set into 2 subsets: those with coordinate 
in the second eigenvector and those with low coordinates. The cut-off can be 
chosen. This algorithm and its variations are widely used |S.\1()() . Not many 
proofs of error bounds are known though. [ST07| prove bounds for planer 
graphs. |KW04| prove that if one repeats this partitioning procedure on the 
subgraphs, then we can ensure that the graph is ultimately split into parts of 
high conductance with not too much edge weight "wasted" between different 
parts. [KM08] prove better bounds for planer graphs. 

Acknowledgements I am grateful to Alan Frieze and Santosh Vempala 
for their collaboration on work reported here and to Santosh also for his 
comments on the manuscript. Thanks to all my coauthors. 
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